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Abstract: 4-Coumarate:CoA ligases (4CLs) are a group of essential enzymes involved in 
the pathway of phenylpropanoid-derived compound metabolisms; however it is still difficult to 
identify orthologs and paralogs of these important enzymes just based on sequence similarity 
of the conserved domains. Using sequence data of 20 plant species from the public databases 
and sequences from Lonicera japonica, we define 1252 adenosine monophosphate 
(AMP)-dependent synthetase/ligase sequences and classify them into three phylogenetic 
clades. 4CLs are in one of the four subgroups, according to their partitioning, with known 
proteins characterized in^. thaliana and Oryza sativa. We also defined 184 non-redundant 
sequences that encode proteins containing the GEICIRG motif and the taxonomic 
distribution of these GEICIRG-containing proteins suggests unique catalytic activities in 
plants. We further analyzed their transcription levels in L. japonica and L. japonica. var. 
chinensis flowers and chose the highest expressed genes representing the subgroups for 
structure and binding site predictions. Coupled with liquid chromatography-mass 
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spectrometry (LC-MS) analysis of the L. japonica flowers, the structural study on putative 
substrate binding amino acid residues, ferulate, and 4-coumaric acid of the conserved 
binding-site of LJ4CLI leads to a conclusion that this highly expressed protein group in the 
flowers may process 4-coumarate that represents 90% of the known phenylpropanoid-derived 
compounds. The activity of purified crude LJ4CLI protein was analyzed using 4-coumarate as 
template and high activity indicating that 4-coumarate is one of the substrates of LJ4CLI. 

Keywords: 4-coumarate:CoA ligase; phenylpropanoid-derived compoimds; Lonicera japonica; 
phylogeny 



1. Introduction 

4-Coumarate:CoA ligases (4CLs, EC 6.2.1.12) are a group of essential enzymes involved in the 
phenylpropanoid-derived compound (PDC) pathway, which converts hydroxylated cinnamic acids into 
their corresponding thioesters [1]. The PDC pathway as well as its branch pathways generates various 
classes of secondary compounds, including lignin, flavones, flavonols, anthocyanins, isoflavonoids, 
and furanocoumarins [2]. PDCs, as a group of the ubiquitous plant secondary metabolites, control 
flower color, pollination, and stress response [3]. Li medicinal plants, certain PDCs have important 
functions, such as anti-inflammatory, anti-tumor, and anti-human immunodeficiency virus activity [4]. 

Due to the importance of phenylpropanoid-derived products in plants, 4CL, characterized as 
a member of a large AMP -binding protein family, has been studied extensively for nearly four 
decades [5]. However it is still difficult to identify them just based on sequence similarity of conserved 
domains. Although functionality can be deduced fi-om the domain composition of proteins and 
enzymes [6], detailed domain analysis of 4CLs remain largely unknown. Their first signature domain 
(Box I) consists of a serine/threonine/glycine (STG)-rich domain followed by a proline/lysine/glycine 
(PKG) triplet [7], whereas the second signature domain contains a GEICIRG motif (Box II) [8]. The 
4CL-catalyzed CoA ester formation takes place via a two-step reaction. In the first step, 4-coumarate 
and ATP form a coumaroyl-adenylate intermediate with simultaneous release of pyrophosphate. 
In the second step, the coumaroyl group is transferred to the sulfhydryl group of CoA, and AMP is 
subsequently released [8]. The mechanism of an adenylate intermediate formation is also common 
among a number of other enzymes with divergent functions, including luciferases, fatty acyl-CoA 
ligases, acetyl-CoA ligases, and the specialized domains of peptide synthetase multienzymes. Despite 
their low overall amino acid sequence identity, similar reaction mechanisms and the presence of 
conserved peptide motifs are used to classify 4CLs into a superfamily of adenylate-forming 
enzymes [9]. The relationship of 4CLs with other adenylate-forming enzymes is recently substantiated 
by a functional analysis of those key 4CL amino acid residues conserved in other adenylate-forming 
enzymes [10]. Phylogenetic analyses of the superfamily of adenylate-forming enzymes show that 
4CL forms a monophyletic plant-specific group more closely related to luciferases rather than to the 
long-chain acyl-CoA ligases and acetyl-CoA ligases [11]. However, Souza et al. [12] reported 
that acyl-CoA sjmthetase is related to 4CL, although it encodes a novel fatty acyl-CoA sjoithetase. 
An in silico analysis revealed that the Arabidopsis genome has 14 genes annotated as putative 
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4- coumarate:CoA ligase isoforms or homologs. Of these genes, only four are catalytically active 
in vitro, with broad substrate specificities [13], and the functions of the others are yet to be characterized. 

4CLs often present in multiple isoforms that exhibit distinct substrate specificity and coincide 
with specific metabolic functions (Figure SI). Substrate of 4CLs include sinapic acid, 

5 - hydroxy ferulate, ferulic acid, caffeic acid, 4-coumarate, and trans-cinnamic acid. Three 4CL isozymes 
from Sorbus aucuparia L. prefer 4-coumaric acid over cinnamic acid in the spectrophotometric assays, 
but fail to utilize benzoic acid in radioisotopic assays [14]. AUina et al. [15] confirmed that multiple 
4CL isoforms present in poplar tissues. However, there has been no evidence to support the differences 
in substrate-utilization profiles of the partially purified native 4CL isoforms or of the two isoforms 
expressed in the recombinant forms. Three of the 4CLs from the bryophj^e Physcomitrella patens 
display similar substrate utilization profiles with high catalytic efficiency towards 4-coumarate, but 
similar efficiency with cinnamate as the substrate to those with caffeate and ferulate [16]. All are 
efficiently activated by 4CLs from various sources except sinapate [17]. Recombinant Ocimum sanctum 
4CL showed the highest activity with /?-coumaric acid, followed by ferulic, caffeic, and trans -c 'mnarmc 
acids [18]. One of the Petunia 4CLs has broad substrate specificity and represents a bona fide 4CL, 
whereas the other is a cinnamate:CoA ligase [19]. The crystal structures of 4CLs from 
Arabidopsis thaliana [20] and Populus tomentosa [21] have already been reported. Information 
regarding 4CL specificity may facilitate predicting substrate preference for the characterization of 
4CL-like proteins [22]. Divergent substrate preference also affects the expression of 4CL genes [1]. A 
differential franscription pattern of each 4CL, in various organs and tissues, as well as distinct temporal 
patterns of expression, has been observed during flower and fi^lit development of raspberry [23]. 
The conttoUed silencing of At4CLl and At4CL2 alter the lignocellulose composition of Arabidopsis 
without affecting its stem growth [24]. Likewise, severe suppression of 4CLs in the coniferous 
Pinus radiata substantially affects plant phenotype and results in dwarfed plants [25]. 

The major active PDCs in Lonicera japonica are fiavones and fiavonols, including chlorogenic acid 
(CGAs) and luteoloside [26-28]. In this study, we aim to determine the characteristics and fimction of 
4CLs in L. japonica. Recently, a number of 4CLs have been characterized in^. thaliana and Oryza sativa 
based on ttanscriptomic studies [29]. Here we report the identification and characterization of LJ4CLs 
and propose the relationship of LJ4CL function and the related active compounds in L. japonica, based 
on expression data [30], protein structure analysis, and substrate characterization. 

2. Results and Discussion 

2.1. Global Phylogeny and Duplication of AMP-Binding Proteins 

Using Pfam (AMP-binding enzyme PF00501) and Interpro (IPR000873 and IPR020845), as well as 
information from public genome databases and our own transcriptome databases of L. japonica, we 
gathered 1252 non-redundant sequences that encode AMP-binding proteins from 20 different species, 
representing a diverse taxonomic background (Table SI). The result shows that AMP-binding proteins 
are widely distributed among bacteria, fungi, animals, and plants. Among 1252 sequences, 
146 putative AMP-binding protein sequences are identified in A. lyrata as compared to 46 and 
10 AMP-binding proteins in Culex quinquefasciatus and Escherichia coli, respectively (Table 1). 
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Table 1. Copy number of AMP -binding domain in 20 species. 
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We classified all AMP-binding protein sequences into three clusters, where 82% of them are 
in Cluster 1. The gjmmospermae species are all found in cluster 1, whereas pteridophj^a, algae, 
monocotyledoneae, and dicotyledoneae are divided into both Clusters 1 and 2. A few sequences fi-om 
A. lyrata and L. japonica Thunb var. chinensis (Wats.) are in Cluster 3. Cluster 1 has four subgroups. 
According to the known function of the proteins in A. thaliana and O. sativa, we speculate that 
long chain acyl-CoA synthase (ACS) belongs to Subgroup 1 and that acyl-acting enzyme (AAE), 
o-succinylbenzoate-CoA ligase, and benzoate-CoA ligase are in Subgroup 2. 4CL is expected to be in 
Subgroup 3, and Subgroup 4 includes acyl-CoA and malonyl-CoA S5aithases. The predicted 4CL 
group is the same as true 4CL enzymes from genome-wide analysis of a land plant-specific 
acylxoenzymeA synthetase (ACS) gene family in Arabidopsis, poplar, rice, and Physcomitrella [31] 
and 4CL-Iike. From the Neighbor- Joining trees, we also found that the gymnospermae sequences are 
only clustered in Subgroup 3, but the algal sequences are not (Figure S2). 
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2.2. Global Phylogeny and Duplication of GEICIRG-Containing Proteins 

In the species of which the genome is completely known, there are 184 non-redundant sequences 
that encode GEICIRG-containing proteins, which are unique to Plantae, including gymnospermae, 
algae, bryophyte, pteridophyta, and angiospermae (Table 2). The GEICIRG motif is absolutely 
conserved in all 4CLs, and its central cysteine residue is suggested to be directly involved in 
catalysis [8], and the participation of a cysteine residue in catalysis has also been observed for other 
adenylate-forming enzymes [32]. 

Table 2. Copy number of containing-GEICIRG protein. 
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To obtain a global view of the phylogenetic relationships among GEICIRG-containing proteins, 
we first constructed an NJ tree, based on their AlVIP-dependent synthetase/ligase domain sequences 
(Figure S3). These sequences clustered together with a strong bootstrap support and four clades are 
clearly distinguishable, including 4CLs, two ASCs, and one AAE protein as subgroups, based on the 
known functions of the proteins in A. thaliana and O. sativa. 

The phylogenetic reconstructions also revealed that subsequent duplication of proteins containing 
the GEICIRG motif occurred in different lineages. Analysis on these clades — 4CLs, ACSs, and AAE 
in dicotyledoneae, monocotyledoneae, pteridophyta, bryophyte, algae, and gymnospermae — suggests 
that gene duplication occurred among the GEICIRG-containing proteins prior to the divergence of 
Angiospermae and Pteridophyta. In the NJ tree, representatives of 4CL and AAE are classified 
into two ASC clades. The major sequences in the AAE clade (Cluster 2) are from Pteridophyta and 
Dicotyledoneae, where only one copy is found in Sorghum bicolor. 
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The phylogeny based on AMP-dependent synthetase/ligase domain sequences (Figures S4 and S5) 
revealed that ACS proteins are classified into two clusters (1 and 4). In dicots, the lineage-specific 
duplication events in Cluster 1 are similar to those in Cluster 4; however, Cluster 4 has more 
duplication events than Cluster 1 in monocots. In cluster 1, two ACS copies are paralogous in 
Chlamydomonas reinhardtii, indicating that duplication occurred early in angiosperm evolution. Three 
ACS copies are also present in Selaginella moellendorffii and are separated by angiosperm sequences, 
which are related to ancient gene duplication. 

The most abundant group, containing 4CLs (Cluster 3), shares the GEICIRG-motif. The phylogeny 
based on AMP-dependent sjaithetase/ligase domain sequences (Figure 1) revealed distinct and 
highly supported clades that group the 4CL proteins into: (1) Dicotyledoneae; (2) Mixed group; and 
(3) Monocotyledoneae. Apart fi-om the early duplication of 4CLs during the evolution of higher plants, 
successive duplications must have occurred among eudicots and monocots as additional 
copies are present in Arabidopsis (7 copies) and rice (7 copies). Ten and 12 copies are also evident in 
Glycine max and in Zea mays, respectively (Figure 2). The high occurrence of duphcation events are 
attributable to frequent genome duphcations in angiosperm evolution. 

2.3. Expression of GEICIRG-Containing Proteins in L. japonica Flowers 

We analyzed the transcription of the GEICIRG-containing proteins based on our own data. In 
Cluster 1, the GEICIRG-containing proteins are clustered into two groups and the first group contains 
two pairs of orthologs. In the other group, the paralogs of L. japonica are found to be expressed at low 
levels, and their average Reads Per Kilo bases per Million reads (RPKM) is 22.91, which is lower than 
that of the first group. Although there are a similar number of orthologs in L. japonica and L. japonica 
var. chinensis, the collective average RPKM value of the GEICIRG-containing paralogs is 1.87-fold 
higher in L. japonica than that in L. japonica var. chinensis. A similar trend is found in Cluster 2 and 
other results are summarized in Table 3. 

2.4. Substrate-Binding Diversity in the Expressed GEICIRG-Containing Proteins in L. japonica 

We examined the structure of four GEICIRG-containing proteins in L. japonica based on their 
expressions — the most highly expressed genes in each cluster (Figure 86; Table 82). Only six conserved 
residues (L-243, H-247, 8-316, Y-342, T-345, and E-346, in LJ4CL1) were found, and the T-345 residue 
has the highest frequency over all others. Another high-frequency residue, M-332, belonging to the 
B-subdomain (adenylation domain), is relatively conserved [40] but is not found in LJAC82. Thus, 
T-345 may be the conserved residue responsible for the function of adenylation domain in L. japonica. 
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Figure 1, Phylogeny and expression of 4CL sequences in Cluster 3. A neighbor-joining 
tree containing 62 sequences was generated based on the AMP-dependent synthetase/Iigase 
domain sequences. A bootstrap value of 1000 replications was applied. The RPKM value 
of sequences in flowers of Lonicera japonica is shown. 
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Table 3. Copy and RPKM of contained GEICIRG protein m Lonicera japonica. 
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Highly expressed LJ4CL1 is found in L. japonica bud [30], and it has the following putative substrate 
binding residues: 1-196, Y-197, G-317, G-343, P-349, V-350, and L-351. The substrate-binding residues 
Y-236, G-306, G-331, P-337, and V-338 in 4CL1 of P. tomentosa are identified, based on their crystal 
structures, as well as mutagenesis and enzymatic activity studies. The Y residues may relate to Pt4CLl 
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activity against caffeic and ferulic acids, but not 4-coumaric acid. Similarly, the G residue may relate 
to Pt4CLl activity against 4-coumaric and caffeic acids [21]. However, Schneider et al. [22] also 
reported that 12 amino acid residues, 1-252, Y-253, N-256, M-293, K-320, G-322, A-323, G-346, G-348, 
P-354, V-355, and L-356, form the substrate specificity code of At4CL. The N-256 residue is located 
at a distance of 3.1 A fi-om the hydrogen atom of the 4-hydroxyl group of caffeic acid. The residues 
M-293 and K-320 formed a clamp structure around the 3-hydroxyl group of caffeate. However, the 
residues N-256, M-293, and K-320 do not seem to have the corresponding residues in LJ4CL1. 

Our results suggest that ferulic acid and 4-coumaric acid are candidate substrates of LJ4CL1. 
To verify our prediction, we analyzed the PDCs of L. japonica flowers using LC-MS. Ten compounds, 
namely, chlorogenic acid, ferulic acid, rutin, hyperoside, isoquercitrin, luteoloside, quercitrin, luteolin, 
quercetin, and apigenin, are identified. Considering the transferase reaction that couples quinic acid to 
cinnamic acid derivatives is reversible, the CGAs are a storage form of cinnamic acid derivatives, and 
are considered as intermediates in the lignin bios)aithetic pathway [41,42]. Based on their chemical 
structure, the intermediates are classified into two groups. The putative substrates of these two groups 
are 4-coumarate and ferulic acid, consistent with the putative fiinction of LJ4CL1. 

1-196, Y-197, G-317, and G-343 are also observed in LJACSl and LJACS2, and their frequencies 
are lower than those in LJ4CL. P-349, V-350, and L-351 are not found in LJACSs and LJAAE, 
and they may be conserved residues related to 4-coumarate. The size of the binding pocket is most 
important in determining the substrate specificities of P. tomentosa 4CLs [21]. Our results suggest that 
the diversity of the residues of the binding-site controls the enzymatic fimction, although they have 
the same conserved catalytic motif. Ninety percent of the substrates for the known PDCs in L. japonica 
are 4-coumarate, and the substrate specificity may be related to the conserved residues in LJ4CLs. 

To confirm our suspect, LJ4CL1 protein was expressed in E. coli. The activity of crude LJ4CL1 
protein was analyzed using 4-coumarate as template and high activity (19.36 U |a,g~' protein"') was 
observed, indicating that 4-coumarate is one of the substrates of LJ4CL1. 

3. Experimental Section 

3.1. Classification of AMP-Dependent Synthetase/Ligase Sequences and GEICIRG-Motif Proteins 

We searched for the AMP -dependent synthetase/ligase sequences of twenty-one species (Table SI) 
using the pfam [43] and InterPro databases. These species include one animal, one bacterium, 
two fungi, two algae, three gymnospermae, two pteridophyta, seven dicotyledoneae, and three 
monocotyledoneae. We compared the sequences against the sequence "GEICIRG" with an e-value 
cut-off below e"^" using Blast? (protein-protein BLAST) [44] to determine the GEICIRG-containing 
proteins from the best reciprocal hits. 

3.2. Phylogeny of AMP-Dependent Synthetase/Ligase Sequences and GEICIRG-Motif Proteins 

We used the AMP-dependent synthetase/ligase domain of AMP-dependent synthetase/ligase 
sequences and GEICIRG-motif proteins to construct neighbor-joining trees using Mega 5.0 [45] and 
ClustalW2 [46], respectively, with a bootstrap value of 1000 replicates. In addition, we reconciled 
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preliminary trees by setting the bootstrap value greater than 50% to yield a more credible 
consensus tree. 

3.3. Indentification of Orthologs and Paralogs 

To identify orthologs, we performed an all-against-all sequence comparison using BLAST with an 
e-value cut-off below e . The orthologs were then determined based on the best reciprocal hits [47]. 
We implemented a more stringent criterion that the alignment length percentage against the longer 
protein must be above 80%. 

3.4. Gene Expression Analyses 

The gene expression pro tiling of L. japonica flowers was performed in a previous work [30]. The 
expression level was normalized with total mapped reads and the contig length, similar to the reads per 
kilobase of exon model per million mapped reads (RPKM) method [48]. The RPKM value for each 
transcript was calculated as the number of reads per kilobase of the transcript sequence per million 
mapped reads [49]. 

3.5. Protein Structure and Binding Site Prediction 

The three-dimensional protein structures were predicted from the amino acid sequences using the 
online version of I-TASSER [50]. Based on the C-score and TM-score, the top ten models were 
predicted and the structural analogs with similar binding sites were identitied. All of the predicted 
binding site residues in the model were summarized. The diversity of the predicted binding site 
residues of four proteins was analyzed. 

3.6. LC-MS Analysis o/L. japonica Flowers 

Dried L. japonica flowers (medicinal materials) were separately comminuted with a miller. Each 
solid sample (40 mesh, 0.50 g) was accurately weighed, and extracted with 50 mL of 70%) aqueous 
ethanol with ultrasonication for 30 min. The extract was cooled to 25 °C, diluted to 50 mL with 70% 
aqueous ethanol, and filtered with a 0.45 )j,m Millipore filter membrane. Then, 10 )j,L of the filtrate was 
injected into the liquid chromatography-mass spectrometry (LC-MS) system (Agilent RRLC/Agilent 
ion trap 6320, Agilent, Santa Clara, CA, USA) for analysis (Figure S7). The LC-MS/MS systems were 
set to a 1.0 mL/min flow rate and performed in an Agilent TC-Cis reserved-phase column (5 \im, 
250 mm x 4.6 mm). The mobile phases consisted of deionized water-formic acid (99:1, v/v) and 
methanol. The elution conditions were same as the high-performance liquid chromatography (HPLC, 
Agilent, Santa Clara, CA, USA) conditions used in a previous work [30]. The detection wavelength 
was set to 242 nm, and the column temperature was maintained at 25 °C. All standard compounds 
were purchased fi-om the National Institutes for Food and Drug Control, Beijing, China. 

3. 7. Expression of 4CL Protein in E.coli and Enzyme Activity Assay 

The open reading fi-ame (ORF) of LJ4CL was cloned into the expression vector pGEX-4T-l and 
transformed into Transetta (DE3) chemically competent cells (Beijing TransGen Biotech Co., Ltd., 
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Beijing, China), respectively. The vector pGEX-4T-l (+) allows inframe cloning of PCR products 
resulting in a GST-tag attached at the iV-terminal end of the recombinant protein. Expression of the 
recombinant protein was induced by adding isopropyl-P-D-l-thiogalactopjo-anoside (IPTG) and cells 
were harvested at 9 h. The activity of 4CL was analyzed according to Voo et al. [51]. The 1 mL 
reaction mixture contained 50 |a,L crude enzyme, 0.2 mM 4-coumarate, 0.8 mM ATP, 7.5 mM MgCb, 
and 38 M CoA in 100 mM Tris-HCl buffer (pH 7.5). One unit of 4CL was defined as the amount of 
enzyme that causes a decrease in A3 3 3 of 0.01 units min~\ Protein concentration in the extracts was 
determined using the Lowry method [52]. 

4. Conclusions 

4CLs form an important enzyme family for the phenylpropanoid-derived pathway in plants. 
Our analysis on AMP -binding and GEICIRG-containing proteins from the genome and transcript 
sequences of 19 species; including an in-house generated dataset containing 40,000 transcript scaffolds 
of L. japonica; allowed us to further exploit 4CL structural (domain and motif) features and validate 
structural predictions based on chemical assays. We also propose the putative substrate-binding 
residues of LJ4CLs and defined the major substrate of the PDC pathway in L. japonica. Our study 
paves a way for fijrther studies on 4CLs and their related metabolic pathways in medicinal plants. 
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